In the second lecture, we learned that JSDMs relax the assumption of species’ independence. This is one of the more critical assumptions made in VGLMMs, that is ecologically unrealistic. It is unrealistic, because we often expect species to co-occur, which results in positive correlation, or the opposite of co-occurring (avoidance) results in negative correlation between species.
These (residual) correlations are difficult to interpret. They can be caused by species interactions, but are confounded with all other sources of unmeasured variation. For example, if we forget to include an important covariate in the model, this will also result in residual correlations. That makes that the associations are useful for improving prediction, but perhaps not so much for inference?
Technically, JSDMs are models for binary data: presence-absence of
species. However, we can generalize this to other data types and
responses by keeping the model on the link-scale the same. In the gllvm
R-package it is pretty straightforward: we just change the
family argument to (for example) “Poisson”, “ordinal”,
“beta” or “tweedie” (see ?gllvm). In this exercise, we will
fit JSDMs to …
We will explore fitting JSDMs using a binary dataset of alpine plants by D’Amen et al. (2018) , used to demonstrate GLLVMs by van der Veen et al. (2021). The data is available in the github repository as “Alpine”, in the data folder. We can read it in as follows:
Y <- read.csv("../../data/alpineY.csv")[,-1]
X <- read.csv("../../data/alpineX.csv")[,-1]
X <- data.frame(lapply(X, function(x)if(is.numeric(x)){scale(x)}else{as.factor(x)}))
you might have to change the exact working directory to make it work for your exact set-up. A good place to start an analysis, is to first examine the data a bit more, so we know what we are dealing with:
dim(Y)
## [1] 912 175
colnames(X)
## [1] "X" "Y" "DDEG0" "SLOPE" "MIND" "SOLRAD" "TPI"
The data are presence-absences, and there are 912 rows and 175 columns (species). Because we will be modelling species-specific responses, we should ensure that each species has enough observations in the data. There area large number of rows, and although the method can technically accommodate sites without any observations, we can speed things up a little by removing the ones that are empty.
min(colSums(Y))
## [1] 22
table(rowSums(ifelse(Y==0,0,1))>3)
##
## FALSE TRUE
## 142 770
From the 175 species, all have at least 22 observations. This is because D’Amen et al. already filtered the species and removed the ones with fewer than 22 presences. Removing species with few presences is a personal decision: it can considerably speed up model fitting and improve convergence (parameters for species with little data are often difficult to estimate), but of course we lose vital information. From the perspective of ordination, as covered in the lecture, we might want to retain species with few observations when they add vital information about the range or limits of the ecological gradient. However, here we are fitting JSDMs, and we take a different angle: the parameters of species with few observations cannot be accurately estimated anyway, so we might as well get rid of them!
We also see that 72 rows in the data have no information, so we get rid of those.
X <- X[rowSums(Y)>0, ]
Y <- Y[rowSums(Y)>0,]
The modeling goes as before, with the gllvm function.
However, now we also use num.lv, which stands for the
number of latent variables added to the model. Here, we fit JSDMs using
the “factor-analytic” approach, or latent variable modeling. Few latent
variables makes for a fast model, but potentially poor estimation of the
correlation of species. So, we are left with trade-off: wait for a long
time for an accurate estimate, or quickly get something slightly less
accurate.
A good place to start, is two latent variables (the default). We can
also add covariates, or random effects, to the model, but we will start
without. We will use a few arguments to speed up the model, as well as
fitting the model in parallel (TIP: you might want to
open a task manager os ystem monitor to keep an eye on your computer’s
resources). The argument Lambda.struc simplifies the
approximation to the likelihood and should be used cautiously, the
argument sd.errors turns off calculation of standard errors
as that can take longer than the actual model fitting at times,
optim.method selects the numerical optimisation algorithm,
because “L-BFGS-B” is often much faster with parallelisation than the
default “BFGS”. All together, this significantly reduces the time to fit
the model from tens of minutes to about half a minute (with 7 CPU). If
you work on a laptop, ensure that your battery settings are not set to
“balanced”, as this too will slow model fitting.
library(gllvm)
TMB::openmp(parallel::detectCores()-1, DLL = "gllvm", autopar = TRUE)
## gllvm
## 7
## attr(,"autopar")
## [1] TRUE
model1 <- gllvm(y = Y, num.lv = 2, family = "binomial", Lambda.struc = "diagonal", sd.errors = FALSE, optim.method = "L-BFGS-B")
Turning off standard error calculation is often a good idea when in
gllvm, when working with large datasets or complex models. When we have
decided on our “final” model, we can post-hoc calculate the standard
errors using the se.gllvm function (an example is shown on
the help page of that function). For now, let’s visualize the estimated
associations:
## gllvm
## 7
## attr(,"autopar")
## [1] TRUE
corrplot::corrplot(getResidualCor(model1), type = "lower", order = "AOE", diag = FALSE, tl.pos = "l", tl.cex = 0.2, addgrid.col = NA)
The corrplot package helps us to order the plot so we might identify
patterns. As you might realize at this point, it is -very- difficult to
make ecological sense of these associations (partly because there are so
many, and now we are “only” working with 175 species).
There are two directions we can take this now: add species-specific
effects (fixed or random), or species-common effects (fixed or random),
or try to go through a procedure of selecting the optimal number of
latent variables to represent the associations, with AIC,
BIC, goodnessOfFit.
?goodnessOfFit helps you to calculate some metrics for the
predictive performance of the model, such as Tjur’s \(R^2\):
goodnessOfFit(Y, object = model1, measure = "TjurR2")
## $TjurR2
## [1] 0.3509823
this is used in distribution modeling to quantify “discriminative power”; if observations (presence or absence) can perfectly be classified by the model, it will equal 1. With more latent variables, or by adding covariates, it will probably improve.
The data includes longitude and latitude columns, as it is specially explicit. The package offers the options for incorporating spatial random row effects to account for autocorrelation that may exist, but these are not particularly fast to implement (yet) and is a work in progress. I do not recommend doing it in this exercise, the following code serves as reference in case you need it for your analysis:
model2 <- gllvm(y = Y, num.lv = 2, family = "binomial", Lambda.struc = "diagonal", sd.errors = FALSE, optim.method = "L-BFGS-B", row.eff = ~corExp(1|site), studyDesign = data.frame(site = 1:nrow(Y)), dist = list(as.matrix(X[,c("X","Y")])))
The goal is to get familiar with Joint Species Distribution Modeling, and the output that the gllvm R-package provides in that context.
Here is what I want you to do:
corrplot functiongoodnessOfFit to try
and find the model with the highest discriminative powercoefplot, randomCoefPlot,
summary to try and draw conclusions about the main drivers
of species’ presence for this datasetIf there is enough time, we will discuss as a group.
Please be advised that the models 4 and 5 in this part are considerably more complex and take time to fit.
The above JSDM does not incorporate traits, or phylogeny. Both of these can help to improve our prediction of species’ distribution. The alpine plant dataset does not include either, so we will use a different dataset that does, for this second part. It is included in the package already, which makes loading it a bit easier.
data(fungi, package = "gllvm")
Y2 <- fungi$Y
X2 <- fungi$X
X2 <- data.frame(lapply(X2, function(x)if(is.numeric(x)){scale(x)}else{as.factor(x)}))
tree <- fungi$tree
covMat<- ape::vcv(tree)
distMat <- ape::vcv(tree)
TR <- fungi$TR
colnames(TR)[8] <- "Sp.log.vol" # funky column name needs to be changed
We again explore the data a little:
any(rowSums(Y2)==0)
## [1] FALSE
dim(Y2)
## [1] 1666 215
colnames(X2)
## [1] "DBH.CM" "AVERDP" "CONNECT10" "TEMPR" "PRECIP" "log.AREA"
## [7] "REGION" "RESERVE"
colnames(TR)
## [1] "PC1" "PC2" "PC3" "PC1.1" "PC1.2"
## [6] "PC1.3" "FB.type" "Sp.log.vol" "Lifestyle"
This a binary dataset by Abrego
et al. (2022), of 215 wood inhabiting fungi inhabiting 1666 logs.
TR include the trait covariates, and X
includes environmental variables related to the forest, or the deadwood,
that the fungi occurred in/on. First, we will fit a model without
phylogeny but with traits, and including “REGION” and “RESERVE” as
random row effects.
model3 <- gllvm(Y2, X = X2, TR = TR, formula = ~DBH.CM+AVERDP+CONNECT10+TEMPR+PRECIP+
(DBH.CM+AVERDP+CONNECT10+TEMPR+PRECIP):(PC1.1+PC1.2+PC1.3),
row.eff = ~(1|REGION/RESERVE), studyDesign = X2[,c("REGION","RESERVE")],
num.lv = 0, family = "binomial", sd.errors = FALSE,
optim.method = "L-BFGS-B")
This model does not include species-specific random effects yet, so should fit pretty quickly. The included “traits” are Principal Components extracted from the original trait matrix, in order to a-priori reduce complexity of the model. Be careful when playing around: with these dataset dimensions the models start using loads of RAM, and when you run out of RAM R will crash, which can quickly happen when we add more traits. In general, when working with large datasets and complex models, using a server for more computing power is adviseable.
We can examine the fourth corner coefficients:
library(lattice)
fourth <- model3$fourth.corner # the coefficients
a <- max(abs(range(fourth)))
colort <- colorRampPalette(c("blue", "white", "red"))
plot.4th <- levelplot((as.matrix(fourth)), xlab = "Environmental Variables",
ylab = "Species traits", col.regions = colort(100), cex.lab = 1.3,
at = seq(-a, a, length = 100), scales = list(x = list(rot = 45)))
plot.4th
Excellent! The trait-environment interaction coefficients have been
included, and visualized. You can also examine them using the model its
summary. With this model we assumed that species’ responses
to the environment are fully determined by species’ traits. Usually, we
do not know if we have measured the right traits, so incorporating
“residual” information in species’ responses to the environmental
covariates is vital. We relax our assumption by fitting another model
(this might take a little):
model4 <- gllvm(Y2, X = X2, TR = TR, formula = ~DBH.CM+AVERDP+CONNECT10+TEMPR+PRECIP+
(DBH.CM+AVERDP+CONNECT10+TEMPR+PRECIP):(PC1.1+PC1.2+PC1.3),
randomX = ~DBH.CM+AVERDP+CONNECT10+TEMPR+PRECIP,
row.eff = ~(1|REGION/RESERVE), studyDesign = X2[,c("REGION","RESERVE")],
num.lv = 0, family = "binomial", sd.errors = FALSE,
optim.method = "L-BFGS-B", Ab.struct = "diagonal", maxit=1e5)
this formulation ensures that the relevant correlations between
species’ random effects are incorporated, which we can also examine with
summary. Before we do that, we will add the final component
to our model; the phylogeny. The gllvm package broadly implements
phylogenetic mixed-effects models, which can also be used to fit models
for traits or individuals (i.e., where traits are the response
variables), but here we focus on our species data. Phylogenetic random
effects make the model considerably more computer intensive (complex),
and there are a range of considerations that we need to make when
fitting it. In particular, the nearest neighbour approximation and the
ordering of the species. There are various ways to find the “optimal”
setting for both those things, which we will not go further into here
(but ask if you want to know!). For further details see the
corresponding vignette in the package. For the following models, you
will need at least 30 minutes computing time. Mind yourself, this is
incredibly fast; it took the original authors ten days with the Hmsc
R-package. For the exercises, you can reduce this further by (e.g.,)
subsetting the data and fitting the models to a smaller number of
species and sites.
e <- eigen(covMat)$vectors[,1]
ord <- gllvm:::findOrder(covMat = covMat, distMat = distMat, nn = 15, order = order(e))$order
spec.ord <- colnames(covMat)[ord]
model5 <- gllvm(Y2[,spec.ord], X = X2, TR = TR, formula = ~DBH.CM+AVERDP+CONNECT10+TEMPR+PRECIP+
(DBH.CM+AVERDP+CONNECT10+TEMPR+PRECIP):(PC1.1+PC1.2+PC1.3),
randomX = ~DBH.CM+AVERDP+CONNECT10+TEMPR+PRECIP,
row.eff = ~(1|REGION/RESERVE), studyDesign = X2[,c("REGION","RESERVE")],
num.lv = 0, family = "binomial", sd.errors = FALSE,
optim.method = "L-BFGS-B", Ab.struct = "MNdiagonal", maxit = 1e5,
colMat = list(covMat[spec.ord, spec.ord], dist = distMat[spec.ord, spec.ord]), colMat.rho.struct = "term", nn.colMat = 15)
Let’s also calculate the standard error for our “final” model:
ses <- se(model5)
model5$sd <- ses$sd
model5$Hess <- ses$Hess
We can examine the phylogenetic signal with summary and
plot the species-specific random effects (without trait effects) jointly
with the phylogeny. We first need to calculate standard errors to do
that:
summary(model5)
##
## Call:
## gllvm(y = Y2[, spec.ord], X = X2, TR = TR, formula = ~DBH.CM +
## AVERDP + CONNECT10 + TEMPR + PRECIP + (DBH.CM + AVERDP +
## CONNECT10 + TEMPR + PRECIP):(PC1.1 + PC1.2 + PC1.3), family = "binomial",
## num.lv = 0, studyDesign = X2[, c("REGION", "RESERVE")], colMat = list(covMat[spec.ord,
## spec.ord], dist = distMat[spec.ord, spec.ord]), colMat.rho.struct = "term",
## row.eff = ~(1 | REGION/RESERVE), sd.errors = FALSE, randomX = ~DBH.CM +
## AVERDP + CONNECT10 + TEMPR + PRECIP, optim.method = "L-BFGS-B",
## Ab.struct = "MNdiagonal", maxit = 1e+05, nn.colMat = 15)
##
## Family: binomial
##
## AIC: 104177.6 AICc: 104178 BIC: 106950.3 LL: -51832 df: 257
##
## Informed LVs: 0
## Constrained LVs: 0
## Unconstrained LVs: 0
##
## Formula: ~DBH.CM+AVERDP+CONNECT10+TEMPR+PRECIP+DBH.CM:PC1.1+DBH.CM:PC1.2+DBH.CM:PC1.3+AVERDP:PC1.1+AVERDP:PC1.2+AVERDP:PC1.3+CONNECT10:PC1.1+CONNECT10:PC1.2+CONNECT10:PC1.3+TEMPR:PC1.1+TEMPR:PC1.2+TEMPR:PC1.3+PRECIP:PC1.1+PRECIP:PC1.2+PRECIP:PC1.3
## LV formula: ~ 0
## Row effect: ~(1 | REGION/RESERVE)
##
## Random effects:
## Name Signal Variance Std.Dev Corr
## DBH.CM 0.0159 0.0021 0.0453
## AVERDP 1.0000 0.1522 0.3901 0.0021
## CONNECT10 0.0000 0.0127 0.1127 -0.0274 0.0075
## TEMPR 0.8186 0.0617 0.2484 -0.3826 -0.0298 0.8912
## PRECIP 0.7764 0.0318 0.1784 0.7568 -0.3552 -0.2760 -0.5933
##
## Coefficients predictors:
## Estimate Std. Error z value Pr(>|z|)
## DBH.CM 0.148190 0.012541 11.817 < 2e-16 ***
## AVERDP -0.324755 0.089667 -3.622 0.000293 ***
## CONNECT10 -0.031708 0.033889 -0.936 0.349449
## TEMPR 0.187505 0.067485 2.778 0.005462 **
## PRECIP -0.008682 0.059848 -0.145 0.884658
## DBH.CM:PC1.1 -0.005848 0.006475 -0.903 0.366405
## DBH.CM:PC1.2 0.009325 0.006187 1.507 0.131761
## DBH.CM:PC1.3 0.008051 0.006488 1.241 0.214674
## AVERDP:PC1.1 -0.044217 0.021957 -2.014 0.044025 *
## AVERDP:PC1.2 0.021316 0.019199 1.110 0.266873
## AVERDP:PC1.3 -0.055204 0.017764 -3.108 0.001886 **
## CONNECT10:PC1.1 -0.005468 0.012011 -0.455 0.648899
## CONNECT10:PC1.2 -0.002231 0.011754 -0.190 0.849453
## CONNECT10:PC1.3 0.010513 0.012002 0.876 0.381059
## TEMPR:PC1.1 0.036291 0.017902 2.027 0.042639 *
## TEMPR:PC1.2 0.014528 0.015413 0.943 0.345901
## TEMPR:PC1.3 0.005430 0.016209 0.335 0.737653
## PRECIP:PC1.1 0.008589 0.014090 0.610 0.542145
## PRECIP:PC1.2 0.004515 0.012774 0.353 0.723760
## PRECIP:PC1.3 0.030275 0.012997 2.329 0.019837 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
phyloplot(model5, tree = tree)
and plot the associations due to the environmental variables:
# There is a known bug in the package, this next line addresses that
model5$col.eff$colMat <- cov2cor(covMat)
corrplot::corrplot(getEnvironCor(model5), type = "lower", order = "AOE", diag = FALSE, tl.pos = "l", tl.cex = 0.2, addgrid.col = NA)